Title: Tech 101: Making Clang sufficiently dumb
Date: 2017-06-12
Modified: 2017-12-07
Category: Blog
Slug: tech-101-making-clang-sufficiently-dumb
Tags: tech-101, howto, c
Summary: How to avoid the Sufficiently Smart Compiler problem with Clang

The title might be baffling to some of you; if it is, I advise you read [this
nice piece][1] which explains the Sufficiently Smart Compiler problem and why
it's an issue far better than I ever could. If by the end of this, you're not
convinced of why this is a problem (and *especially* a problem for a language
like C), then feel free to ignore this whole article and go about your merry
way. However, if you, like me, find this concerning, here is a set of
instructions of how you can avoid this problem while using Clang. I will only
discuss C here; if you care about C++, write your own blog post.

As an aside: if your compiler of choice is GCC, the approach is a little bit
different. I've written [some POSIX shell][2] to deal with this problem, so you
can just use those. I would appreciate feedback on it though.

## The basics ##

Before I can fully launch into how you can avoid being sufficiently smart with
Clang, it's worth discussing how compilers work. We're going to discuss this in
three steps: firstly, compilers generally; secondly, C compilers (at least
conceptually); and lastly, the LLVM compiler stack specificlaly. If you already 
understand how a compiler, or a C compiler, typically works, you shouldn't find 
this too hard; if not, you're in for a fun ride.

### Compiler basics ###

A *compiler* is nothing more than an automatic translation program, which turns
an *input* in one language (typically called the *source language*) into an
*output* in a different one (typically called the *target language*). In the 
case of C specifically: 

* The *input* is source code (in the form of one or more files)
* The *source language* is C (obviously)
* The *output* is a library or executable (depending on what we want)
* The *target language* is usually assembly language for whatever architecture
  we're currently on (x86\_64, ARM, or whatever)

When we talk about *languages*, we generally care about two things:

1. What the language looks like, and the rules of its formation (its *syntax*);
   and
1. What the language means (its *semantics*)

We can also, by extension, talk about syntax and semantics for a given *input*.
Any given input can be either *syntactically valid* in some language (meaning
that it follows the rules of the way the language looks), or
*syntactically invalid* in any language (meaning that it's gibberish). Given
that an input is syntactically valid in some language, that input also has
semantics in that language (i.e. its meaning). Now, whether this is the meaning
you *intended* or not is not given, but the fact is, syntactically invalid input
*cannot* mean anything, correct or not. 

Now, when we do a translation of some input, provided that the input has any
meaning at all, we want the translation to have the same meaning. To use our
fancy vocabulary above, given syntactically valid input, a compiler must produce
syntactically valid output with the same semantics as the input; this
fundamentally means that compilation is a *syntactic* task. At the same time,
however, a compiler must ensure several things as part of its operation:

* That the input is syntactically valid in the source language (we can't
  translate gibberish)
* That the output is syntactically valid in the target language (non-gibberish
  never means the same as gibberish)
* The semantics of the output must be the same as the semantics of the input (or
  we're lying to our users)

It'd also be nice if we could also ensure the following things:

* That the input has the semantics we intended
* That the output is as efficient as possible

Now obviously, these two are difficult if not impossible in the general case:
the first one would involve our compiler being literally able to read our minds
(at least), while the second is a major reason why the Sufficiently Smart
Compiler problem exists in the first place. In theory, these two are not
*strictly* necessary for a compiler to do its work, but in some limited cases,
we can still get some of those benefits.

Now that we've gone over *what* compilers are meant to do, let's look over *how*
they go about doing it, in a general sense. Traditionally, the work of a
compiler is divided into five 'phases' of operation:

1. Lexical analysis (or *lexing*) - taking input in (what we expect to be) the source
   language, and turning it into a stream of syntax pieces (or *tokens*); if we
   can't, reject and abort
2. Syntactic analysis (or *parsing*) - taking the stream of tokens and turning
   it into a tree representing its semantics (called an *abstract
   syntax tree* or *AST*); if we can't, reject and abort
3. Semantic analysis - taking an AST and checking that its semantics are
   remotely what we intended or make any reasonable sense; if we find something
   that doesn't, reject and abort
4. Optimization - taking the AST and turning it into an AST with the same
   semantics, but (ideally) one that can be transformed into more efficient code
5. Code generation - taking the AST and emitting syntactically valid output in
   the target language with the same semantics

Now, we can clearly see that the first two phases are about syntax, or rather,
making sure that what we've been given is indeed a syntactically valid input in
the source language. We need to do this because humans are fallible, and code is
extremely specific in its meaning. Once we get past step 3 (the only
place where we directly consider semantics), we are no longer dealing with the
source language at all, because the AST represents the semantics that we're
interested in, in a way that's not dependent on any specific source language.
Code generation doesn't really need the thorough checking we do on the input
precisely for this reason - we essentially have 'pure semantics', with no
possibility of confusion or bad inputs.

Typically, the first two phases are called the 'front end', and the last two are
called the 'back end'. Semantic analysis is a bit of an odd one, and can
potentially belong in either camp. This separation is not coincidental; by
keeping these two separate, we gain multiple benefits: the back end doesn't have
to be concerned about the source language; the front end doesn't have to care
about the target language; we can port any source language we have a front end
for to a new target language by just changing the back end; we can compile many 
different source languages into the same target language by just changing the
front end; and so on. Thus, to nobody's great surprise, most compilers these
days take exactly this approach.

### Compiling C ###

The C language in particular adds a few extra moments to the above description,
although still being broadly the same. More exactly, C has three additional
concepts: *preprocessing*, *separate compilation* and *translation units*, and
*linking*, all of which we will describe (again, in a general way) next. Of
these, preprocessing is the simplest: it is essentially a simple copy-paste
operation performed on C source files before any of the code is fed to the
compiler proper.

With respect to separate compilation, C allows you to divide
up your program into *translation units*, which you can think of as C source
files. Each of these is compiled separately, and only combined together at the
end. This doesn't really change the five-phase description above; it just means
that it runs many times over your program, not all in one go. It also means each
translation unit undergoes each process separately - in fact, this is one reason
why header files are necessary, as without them, we wouldn't easily have enough
information to make many of the checks required. It also limits some of the work
of the optimization pass due to having only per-unit information (although in
practice, all modern compilers have ways to avoid this problem).

Now, once we've compiled all the translation units (separately), we need some
way to combine them into a single output (the exact nature of which might vary
depending on what we want). This step, called *linking*, has several
responsibilities:

* If a translation unit calls a function in another translation unit, these
  calls must be routed appropriately
* If we use any libraries, we have to make sure that they're either included
  into our output, or set up some kind of runtime resolution
* Depending on what output we want, set up the appropriate scaffolding (such as
  an entry point for an executable)

It's worth noting that, at this point, all the code has already been compiled -
when we link, no more compilation takes place. In particular, it means that the
*linker* (the program which does all the above) isn't aware of anything to do
with the source language anymore; it operates entirely on the target language.
Now, modern compilers don't really work this way exactly, but once again,
keeping that mental model can help understand much of what's to come (and a
bunch of other things in C besides, which are written with exactly this model in
mind).

Thus, in short, your C codebase undergoes the following steps:

1. For each translation unit, apply whatever copy-paste operations required by
   their preprocessor directives
2. Compile each translation unit, using the five-phase model described
   previously
3. Link everything together

### How Clang and LLVM do it ###

Now that all that is over, we can talk about how Clang and LLVM do this. Essentially,
LLVM makes a slight difference to the five-phase model above, and adjusts the C
compiler model a little to accomodate this. After C code passes through its
front end, Clang emit *LLVM bytecode*, which can be thought of as assembly language
for some non-existent 'LLVM machine'. No optimization pass is done here, and the
actual target language is not generated: in a sense, LLVM bytecode is a
representation of the AST we described before. After this, *optimization passes*
are run on the LLVM bytecode, which serve the role of the optimization phase
described above; each pass does one kind of optimization, and emits optimized
LLVM bytecode. Then, we convert the LLVM bytecode into assembly language for the
actual target language (although we can execute LLVM bytecode directly if we
really want).

Thus, our model of a C compiler gets modified a little:

1. For each translation unit, apply whatever copy-paste operations are required
   by their preprocessor directive
2. For each translation unit, go through the front end; if all goes well, emit
   LLVM bytecode
3. For each translation unit, for each optimization pass, feed the LLVM bytecode
   through it and collect the output
4. For each translation unit, feed the LLVM bytecode through the code generator
   to emit the target language
5. Link everything together

This model is quite helpful, because thanks to the way Clang and LLVM are
designed, we can 'intercept' the results from any of these steps, and change
them to suit ourselves. In particular, step 3 can give us a lot of control by
simply choosing which optimization passes to run or not to run. Additionally, we
can inspect the LLVM bytecode at each stage to see what effect we're getting (or
not getting) and whether we actually want this or not. Considering that C
encourages small, tight translation units, this works in our favour, because we
can easily check what our compiler and optimizer are emitting, and take
appropriate actions if not. This is unlike the more monolithic model of GCC,
where such interception is basically not possible. With this, we can address the
problem of sufficient smartness appropriately.

## Making it all work ##

> "I get it Koz, but I didn't come here for a lecture. Show me how to fix the
> Sufficiently Smart Compiler problem already!"

Consider the (rather daft) C99 code below, in a file called ``main.c``:

    ::c
    #include <stdlib.h>

    static int loop (int x) {
      if (x) {
        loop(x - 1);
      }
      return x;
    }

    int main (void) {
      return loop(4);
    }

The smart cookies in the audience will recognize that the ``loop`` function
above can be trivially turned into a [proper tail call][3]. Let's set ourselves
up so that we can have this optimization, but not any others.

We first have to get Clang to spew out LLVM bytecode. We do this by invoking the
following command:

``clang -emit-llvm -c main.c``

Feel free to add whatever warning, standard etc. options you want here (which
you should always do anyway). If all goes well, this will do all the
preprocessing and front-end work, and dump an LLVM bytecode file (called
``main.bc``) in the same directory. This bytecode file is not meant to be
easily read by people; if we want to inspect it, we need to convert it into a
readable bytecode file. To do that, we use ``llvm-dis`` as follows:

``llvm-dis main.bc``

This will emit a ``main.ll`` file, which we can have a look at with an ordinary
text editor (or ``less`` if you so prefer). If you can read assembly language
(of any sort), you'll feel right at home here. If not, don't worry - it's not
hard, and I won't make you read it this time. However, even a cursory
examination will show us that we don't have a proper tail call here, because of
this:

    ::llvm
    ; Function Attrs: noinline nounwind uwtable
    define internal i32 @loop(i32) #0 {
    ; some stuff we can live without
      br i1 %4, label %5, label %9

    ; <label>:5:                                      ; preds = %1
    ; more stuff that really doesn't matter
      %8 = call i32 @loop(i32 %7)
      br label %9

    ; <label>:9:                                      ; preds = %5, %1
      ret i32 0
    } 

Clearly, we end up calling ``loop`` over and over, which will pile up on the
call stack. Now, when we call ``loop`` with an argument of 4, this is a
non-issue, but if we were to call it with a bigger value, it could easily
overflow the call stack. So let's fix it!

The program LLVM uses to perform optimizations is ``opt``. It can perform a
large range of different optimizations - a full list is [here][4] - but the one
we're most interested in is [this one][5]:

> ``-tailcallelim``: Tail Call Elimination
> 
> This file transforms calls of the current function (self recursion) followed 
> by a return instruction with a branch to the entry of the function, creating 
> a loop.

To apply it, we need to invoke ``opt`` on our LLVM bytecode (that's the ``.bc``
file, *not* the .ll file), with ``-tailcallelim`` as a flag. By default, ``opt``
will dump the result to the standard output, so we need to tell it to put the
result in a file called ``main.opt.bc``:

``opt -tailcallelim -o=main.opt.bc main.bc``

Let's have a look at the result, after running it through ``llvm-dis``:

    ::llvm
    ; Function Attrs: noinline nounwind uwtable
    define internal i32 @loop(i32) #0 {
    ; some stuff we don't care about
      br label %tailrecurse

    tailrecurse:                                      ; preds = %5, %1
    ; more stuff that doesn't really matter
      br i1 %4, label %5, label %8

    ; <label>:5:                                      ; preds = %tailrecurse
      %6 = load i32, i32* %2, align 4
      %7 = sub nsw i32 %6, 1
      br label %tailrecurse

    ; <label>:8:                                      ; preds = %tailrecurse
      ret i32 0
    }

Much better! Since we now don't have calls piling up, our code will be as fast
as a loop. Because we're happy with what we have, it's time to turn our LLVM
bytecode into assembly language, using ``llc`` (which is the LLVM bytecode
'assembler'). We have to specify that we want an object file, rather than
textual assembly - if you want to see what it assembles into, change the
``filetype=`` option to ``asm``:

``llc -filetype=obj -o main.o main.opt.bc``

Now, we can link together our program. While we could do this using the linker
directly, it's more convenient for us to use ``clang`` again:

``clang -o main main.o``

And there we have it - one beautiful binary, with only those optimizations we
expressly asked for!

## What optimizations are 'safe'? ##

The number of different optimization passes ``opt`` can do on your code, and the
associated documentation, could choke a whale. While I can't really give better
advice than 'read the documentation and keep your problem in mind', the
following passes are pretty predictable and safe to do 'by default': dead code
or store elimination and proper tail calls. In the future, I might write some
examples or further stuff on this, though, so stay tuned!

> "But Koz, all those commands are a real pain! What could I do to make it hurt
> less?"

Glad you asked! Check out [tup][6]; you won't be sorry.

## Conclusion and going further ##

Whew, what a post! We somewhat-casually covered some of the internal 
workings of Clang and LLVM, compilers and the C compilation process, as well as
ways to avoid the Sufficiently Smart Compiler problem. In future posts, we'll
go into how we can do even better than this (this simple code is nearly a 10K
binary, which is far too big).

[1]: http://prog21.dadgum.com/40.html
[2]: https://notabug.org/koz.ross/sscc
[3]: https://en.wikipedia.org/wiki/Tail-call_optimization
[4]: http://llvm.org/docs/Passes.html#transform-passes
[5]: http://llvm.org/docs/Passes.html#tailcallelim-tail-call-elimination
[6]: http://gittup.org/tup/index.html 
